In this notebook, we will do a comprehensive analysis of the Android app market by comparing thousands of apps in the Google Play store.
Data Source:
App and review data was scraped from the Google Play Store by Lavanya Gupta in 2018. Original files listed here.
import pandas as pd
import numpy as np
from IPython.display import display, HTML # so we can use .display() instead of print(), and render HTML as well
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format # only works for float though, no thousand separator for int
# Code if we want to format integers with a thousand separator
num_format = lambda x: '{:,}'.format(x)
def build_formatters(df, format):
return {
column:format
for column, dtype in df.dtypes.items()
if dtype in [ np.dtype('int64'), np.dtype('int32') ]
}
df_apps = pd.read_csv('apps.csv')
Challenge: How many rows and columns does df_apps have? What are the column names? Look at a random sample of 5 different rows with .sample().
df_apps.shape
(10841, 12)
df_apps.columns
Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
dtype='object')
df_apps.sample(5) # to show a random sample, instead of .head() always showing the top 5
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9787 | FOX | ENTERTAINMENT | 3.70 | 197774 | 44.00 | 10,000,000 | Free | 0 | Mature 17+ | Entertainment | June 25, 2018 | 4.4 and up |
| 1757 | Tricky Bike Stunt Rider DX | FAMILY | 4.40 | 16 | 45.00 | 500 | Free | 0 | Everyone | Simulation;Education | July 29, 2018 | 4.0 and up |
| 2223 | BF Beautiful Nature | FAMILY | 3.60 | 5 | 2.50 | 1,000 | Free | 0 | Mature 17+ | Entertainment | June 11, 2015 | 3.0 and up |
| 155 | ear super hearing | MEDICAL | NaN | 0 | 1.40 | 5 | Free | 0 | Everyone | Medical | July 31, 2018 | 4.0 and up |
| 7324 | Identity V | GAME | 4.20 | 109263 | 38.00 | 1,000,000 | Free | 0 | Teen | Action | July 10, 2018 | 4.1 and up |
Challenge: Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.
df_apps.drop(columns=['Last_Updated', 'Android_Ver'], inplace=True)
df_apps.sample(5)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6697 | Network Signal Info | TOOLS | 4.10 | 33926 | 17.00 | 1,000,000 | Free | 0 | Everyone | Tools |
| 7342 | Learn JavaScript | EDUCATION | 4.60 | 25183 | 5.40 | 1,000,000 | Free | 0 | Everyone | Education |
| 10666 | Garena Free Fire | GAME | 4.50 | 5534114 | 53.00 | 100,000,000 | Free | 0 | Teen | Action |
| 9877 | Rolly Vortex | GAME | 4.40 | 98123 | 40.00 | 10,000,000 | Free | 0 | Teen | Arcade |
| 9664 | Operate Now: Hospital | FAMILY | 4.30 | 254861 | 52.00 | 10,000,000 | Free | 0 | Teen | Simulation |
Challenge: How may rows have a NaN value (not-a-number) in the Ratings column? Create DataFrame called df_apps_clean that does not include these rows.
df_apps.Rating.isna().sum()
1474
1474 rows have NaN as rating value.
df_apps.loc[df_apps.Rating.isna()].sample(5)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 483 | FP Live | COMMUNICATION | NaN | 0 | 3.30 | 10 | Free | 0 | Teen | Communication |
| 219 | DS-Admin | FAMILY | NaN | 1 | 21.00 | 10 | Free | 0 | Everyone | Education |
| 2078 | AX-PIC Pedidos | TOOLS | NaN | 1 | 3.40 | 1,000 | Free | 0 | Everyone | Tools |
| 2267 | Fahrschule DW | FAMILY | NaN | 5 | 3.90 | 1,000 | Free | 0 | Everyone | Education |
| 1201 | AO-EVENT | BUSINESS | NaN | 0 | 42.00 | 100 | Free | 0 | Everyone | Business |
We see that NaN values in ratings are associated with low reviews and lows installs. That makes sense.
df_apps_clean = df_apps.dropna()
df_apps_clean.shape
(9367, 10)
9367 + 1474 = 10841 rows initially -> OK
Challenge: Are there any duplicates in data? Check for duplicates using the .duplicated() function. How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.
df_apps_clean.duplicated().tail()
10836 False 10837 False 10838 False 10839 True 10840 False dtype: bool
print(f"{df_apps_clean.duplicated().sum()} entries are pure duplicates.")
476 entries are pure duplicates.
df_apps_clean.loc[df_apps_clean.duplicated()].sort_values(by='App')
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6371 | 10 Best Foods for You | HEALTH_AND_FITNESS | 4.00 | 2490 | 3.80 | 500,000 | Free | 0 | Everyone 10+ | Health & Fitness |
| 7159 | 1800 Contacts - Lens Store | MEDICAL | 4.70 | 23160 | 26.00 | 1,000,000 | Free | 0 | Everyone | Medical |
| 2604 | 2017 EMRA Antibiotic Guide | MEDICAL | 4.40 | 12 | 3.80 | 1,000 | Paid | $16.99 | Everyone | Medical |
| 5140 | 21-Day Meditation Experience | HEALTH_AND_FITNESS | 4.40 | 11506 | 15.00 | 100,000 | Free | 0 | Everyone | Health & Fitness |
| 946 | 420 BZ Budeze Delivery | MEDICAL | 5.00 | 2 | 11.00 | 100 | Free | 0 | Mature 17+ | Medical |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9635 | theScore: Live Sports Scores, News, Stats & Vi... | SPORTS | 4.40 | 133833 | 34.00 | 10,000,000 | Free | 0 | Everyone 10+ | Sports |
| 9634 | theScore: Live Sports Scores, News, Stats & Vi... | SPORTS | 4.40 | 133833 | 34.00 | 10,000,000 | Free | 0 | Everyone 10+ | Sports |
| 9632 | theScore: Live Sports Scores, News, Stats & Vi... | SPORTS | 4.40 | 133825 | 34.00 | 10,000,000 | Free | 0 | Everyone 10+ | Sports |
| 10156 | trivago: Hotels & Travel | TRAVEL_AND_LOCAL | 4.20 | 219848 | 12.00 | 50,000,000 | Free | 0 | Everyone | Travel & Local |
| 10159 | trivago: Hotels & Travel | TRAVEL_AND_LOCAL | 4.20 | 219848 | 12.00 | 50,000,000 | Free | 0 | Everyone | Travel & Local |
476 rows × 10 columns
-> Careful there, even if a row is seen only once when displaying the duplicates, it is actually at least twice in the original df.
By default, each column must be identical for .duplicated() to consider two rows duplicates. But we can consider only some columns. E.g. same App name, same Type and same Price
print(f"In this case, {df_apps_clean.duplicated(subset=['App', 'Type', 'Price']).sum()} entries are duplicates.")
In this case, 1168 entries are duplicates.
df_apps_clean = df_apps_clean.drop_duplicates(subset=['App', 'Type', 'Price']) # same subset used for drop_duplicates()
df_apps_clean.duplicated().any()
False
df_apps_clean.shape
(8199, 10)
Duplicates removed, now left with 8199 rows.
So we can see that 13 different features were originally scraped from the Google Play Store.
Obviously, the data is just a sample out of all the Android apps. It doesn't include all Android apps of which there are millions.
I’ll assume that the sample is representative of the App Store as a whole. This is not necessarily the case as, during the web scraping process, this sample was served up based on geographical location and user behaviour of the person who scraped it - in our case Lavanya Gupta.
The data was compiled around 2017/2018. The pricing data reflect the price in USD Dollars at the time of scraping. (developers can offer promotions and change their app’s pricing).
I’ve converted the app’s size to a floating-point number in MBs. If data was missing, it has been replaced by the average size for that category.
The installs are not the exact number of installs. If an app has 245,239 installs then Google will simply report an order of magnitude like 100,000+. I’ve removed the '+' and we’ll assume the exact number of installs in that column for simplicity.
Here’s what you would see under an Android app listing if you go to a listing on the Google Play Store:

Challenge: Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?
df_apps_clean.Rating.max()
5.0
df_apps_clean.loc[df_apps_clean.Rating == 5].App.count()
271
271 Apps have the highest rating of 5
df_apps_clean.loc[df_apps_clean.Rating == 5].sort_values('Reviews').head()
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1441 | DH-UFO | FAMILY | 5.00 | 1 | 59.00 | 100 | Free | 0 | Everyone | Entertainment |
| 310 | DT CLOTHINGS | SHOPPING | 5.00 | 1 | 7.90 | 10 | Free | 0 | Everyone | Shopping |
| 312 | 30WPM Amateur ham radio Koch CW Morse code tra... | FAMILY | 5.00 | 1 | 3.70 | 10 | Paid | $1.49 | Everyone | Education |
| 1272 | CG Prints | PHOTOGRAPHY | 5.00 | 1 | 2.30 | 100 | Free | 0 | Everyone | Photography |
| 321 | FK CLASSIC FOR YOU | BUSINESS | 5.00 | 1 | 3.50 | 10 | Free | 0 | Everyone | Business |
A lot of them have a 5.0 rating but have only one review!
Challenge: What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be limit in place or can developers make apps as large as they please?
df_apps_clean.Size_MBs.max()
100.0
df_apps_clean.loc[df_apps_clean.Size_MBs == 100].App.count()
14
14 apps have the same largest size value (100 MB), this must be the limit in place.
Challenge: Which apps have the highest number of reviews? Are there any paid apps among the top 50?
df_apps_clean.sort_values('Reviews', ascending=False)[:5]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10805 | SOCIAL | 4.10 | 78158306 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10785 | WhatsApp Messenger | COMMUNICATION | 4.40 | 69119316 | 3.50 | 1,000,000,000 | Free | 0 | Everyone | Communication |
| 10806 | SOCIAL | 4.50 | 66577313 | 5.30 | 1,000,000,000 | Free | 0 | Teen | Social | |
| 10784 | Messenger – Text and Video Chat for Free | COMMUNICATION | 4.00 | 56642847 | 3.50 | 1,000,000,000 | Free | 0 | Everyone | Communication |
| 10650 | Clash of Clans | GAME | 4.60 | 44891723 | 98.00 | 100,000,000 | Free | 0 | Everyone 10+ | Strategy |
top_50_reviews = df_apps_clean.sort_values('Reviews', ascending=False)[:50]
top_50_reviews.loc[top_50_reviews.Type == 'Paid']
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres |
|---|
print(f"Are there any paid apps among the top 50? {top_50_reviews.loc[top_50_reviews.Type == 'Paid'].App.any()}")
Are there any paid apps among the top 50? False
First, we’ll count the number of occurrences of each Content Rating:
ratings = df_apps_clean.Content_Rating.value_counts().to_frame()
ratings
| Content_Rating | |
|---|---|
| Everyone | 6621 |
| Teen | 912 |
| Mature 17+ | 357 |
| Everyone 10+ | 305 |
| Adults only 18+ | 3 |
| Unrated | 1 |
The first step in creating charts with plotly is to import plotly.express. This is the fastest way to create a beautiful graphic with a minimal amount of code in plotly.
import plotly.express as px
ratings.rename(columns={'Content_Rating': 'App_Count'}, inplace=True)
ratings.index.name = 'Content_Rating'
ratings
| App_Count | |
|---|---|
| Content_Rating | |
| Everyone | 6621 |
| Teen | 912 |
| Mature 17+ | 357 |
| Everyone 10+ | 305 |
| Adults only 18+ | 3 |
| Unrated | 1 |
# fig = px.pie(ratings, values='App_Count', names='Content_Rating') # doesn't work -> "not the name of a column"
fig = px.pie(ratings, values='App_Count', names=ratings.index)
fig.show()
For further tuning, we call fig.update_traces to set other parameters of the chart (you can also use fig.update_layout for changing the layout).
Examples:
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_traces(textposition='outside', textinfo='percent+label')
See the Plotly pie documentation and search for "update_traces" to see all possible **kwargs for fig.update_traces
fig = px.pie(ratings, values='App_Count', names=ratings.index, title='App by Content Rating', height=500) # adding title and height
fig.update_traces(textposition='outside', textinfo='percent+label')
fig.show()
fig = px.pie(ratings, values='App_Count', names=ratings.index, title='App by Content Rating', height=500,
hole=0.6)
fig.update_traces(textposition='inside', textinfo='percent', textfont_size=15) # changing the style
fig.show()
Challenge: How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?
Check the datatype of the Installs column.
Count the number of apps at each level of installations.
Convert the number of installations (the Installs column) to a numeric data type. Hint: this is a 2-step process. You'll have make sure you remove non-numeric characters first.
df_apps_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 8199 entries, 21 to 10835 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 8199 non-null object 1 Category 8199 non-null object 2 Rating 8199 non-null float64 3 Reviews 8199 non-null int64 4 Size_MBs 8199 non-null float64 5 Installs 8199 non-null object 6 Type 8199 non-null object 7 Price 8199 non-null object 8 Content_Rating 8199 non-null object 9 Genres 8199 non-null object dtypes: float64(2), int64(1), object(7) memory usage: 704.6+ KB
I could also use the function .describe() to get information on a Series (e.g. df_apps_clean.Installs.describe())
df_apps_clean.Installs
21 1
28 1
47 1
82 5
99 5
...
10824 1,000,000,000
10828 1,000,000,000
10829 1,000,000,000
10831 1,000,000,000
10835 1,000,000,000
Name: Installs, Length: 8199, dtype: object
type(df_apps_clean.Installs[28])
str
The "Installs" columns contains strings.
count_installs = df_apps_clean.Installs.value_counts().to_frame(name='App_Count')
count_installs.index.name = 'Nb_of_Installs'
count_installs.sample(3)
| App_Count | |
|---|---|
| Nb_of_Installs | |
| 10,000 | 988 |
| 1 | 3 |
| 500,000 | 504 |
# No need for a for loop, just using .str allows us to use string methods on a Series
count_installs.index = count_installs.index.str.replace(',', '')
count_installs.sample(3)
| App_Count | |
|---|---|
| Nb_of_Installs | |
| 100000 | 1096 |
| 10000000 | 933 |
| 5000000 | 607 |
count_installs.index = pd.to_numeric(count_installs.index)
count_installs.sort_values('Nb_of_Installs', inplace=True)
count_installs.head()
| App_Count | |
|---|---|
| Nb_of_Installs | |
| 1 | 3 |
| 5 | 9 |
| 10 | 69 |
| 50 | 56 |
| 100 | 303 |
Alternatively, instead of using df_apps_clean.Installs.value_counts() to get the number of apps at each level of installations, we could also have used .groupby().count():
count_installs_v2 = df_apps_clean[['App', 'Installs']].groupby('Installs').count()
count_installs_v2.head()
| App | |
|---|---|
| Installs | |
| 1 | 3 |
| 1,000 | 698 |
| 1,000,000 | 1417 |
| 1,000,000,000 | 20 |
| 10 | 69 |
Also, another method exists to convert a Series from one type to another: .astype()
type(count_installs.index[5])
numpy.int64
type(count_installs_v2.index[5])
str
type(count_installs.index.astype(str)[5])
str
type(count_installs_v2.index.astype(str).str.replace(",", "").astype(int)[5])
numpy.int64
So essentially, we can use Series.astype(str) to convert it to Strings. And we can also use Series.astype(int) instead of pd.to_numeric(Series)
df_apps_clean.Installs = df_apps_clean.Installs.str.replace(",", "").astype(int)
type(df_apps_clean.Installs[82])
numpy.int32
df_apps_clean.sample(2)
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3608 | BJJ Roadmap by Stephan Kesting | SPORTS | 4.60 | 969 | 3.60 | 10000 | Free | 0 | Everyone | Sports |
| 3195 | kick the buddy | GAME | 3.30 | 294 | 19.00 | 10000 | Free | 0 | Everyone | Arcade |
When displaying the df, we don't have any thousand separators for the integer values (Reviews and Installs). So we can use this code:
df_apps_clean.Installs = df_apps_clean.Installs.map('{:,}'.format)
It works well, and displays as it should. But then, we're dealing with string values again. It's possible to create a variable dedicated for display only though.
Otherwise we can also use this method (formatters initialised in Notebook-Presentation)
formatters = build_formatters(df_apps_clean, num_format) # We build the formatter with the df
hmtl_code = df_apps_clean.to_html(formatters=formatters) # We create the HTML code to display the table the way we want
HTML(html_code) # We use the HTML function to render the code
This last method will display the WHOLE table though... Note: HTML code can used to display the table in a website for instance.
Let's examine the Price column more closely.
Challenge: Convert the price column to numeric data. Then investigate the top 20 most expensive apps in the dataset.
Remove all apps that cost more than $250 from the df_apps_clean DataFrame.
Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest grossing paid apps according to this estimate? Out of the top 10 highest grossing paid apps, how many are games?
First, we convert the Price into floats
df_apps_clean.Price = pd.to_numeric(df_apps_clean.Price.str.replace("$", "", regex=False))
-> See pandas .str.replace() Documentation about the regex argument (if not set to False or True, I get a warning)
Then investigate the top 20 most expensive apps in the dataset.
df_apps_clean.sort_values('Price', ascending=False)[:20]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3946 | I'm Rich - Trump Edition | LIFESTYLE | 3.60 | 275 | 7.30 | 10000 | Paid | 400.00 | Everyone | Lifestyle |
| 2461 | I AM RICH PRO PLUS | FINANCE | 4.00 | 36 | 41.00 | 1000 | Paid | 399.99 | Everyone | Finance |
| 4606 | I Am Rich Premium | FINANCE | 4.10 | 1867 | 4.70 | 50000 | Paid | 399.99 | Everyone | Finance |
| 3145 | I am rich(premium) | FINANCE | 3.50 | 472 | 0.94 | 5000 | Paid | 399.99 | Everyone | Finance |
| 3554 | 💎 I'm rich | LIFESTYLE | 3.80 | 718 | 26.00 | 10000 | Paid | 399.99 | Everyone | Lifestyle |
| 5765 | I am rich | LIFESTYLE | 3.80 | 3547 | 1.80 | 100000 | Paid | 399.99 | Everyone | Lifestyle |
| 1946 | I am rich (Most expensive app) | FINANCE | 4.10 | 129 | 2.70 | 1000 | Paid | 399.99 | Teen | Finance |
| 2775 | I Am Rich Pro | FAMILY | 4.40 | 201 | 2.70 | 5000 | Paid | 399.99 | Everyone | Entertainment |
| 3221 | I am Rich Plus | FAMILY | 4.00 | 856 | 8.70 | 10000 | Paid | 399.99 | Everyone | Entertainment |
| 3114 | I am Rich | FINANCE | 4.30 | 180 | 3.80 | 5000 | Paid | 399.99 | Everyone | Finance |
| 1331 | most expensive app (H) | FAMILY | 4.30 | 6 | 1.50 | 100 | Paid | 399.99 | Everyone | Entertainment |
| 2394 | I am Rich! | FINANCE | 3.80 | 93 | 22.00 | 1000 | Paid | 399.99 | Everyone | Finance |
| 3897 | I Am Rich | FAMILY | 3.60 | 217 | 4.90 | 10000 | Paid | 389.99 | Everyone | Entertainment |
| 2193 | I am extremely Rich | LIFESTYLE | 2.90 | 41 | 2.90 | 1000 | Paid | 379.99 | Everyone | Lifestyle |
| 3856 | I am rich VIP | LIFESTYLE | 3.80 | 411 | 2.60 | 10000 | Paid | 299.99 | Everyone | Lifestyle |
| 2281 | Vargo Anesthesia Mega App | MEDICAL | 4.60 | 92 | 32.00 | 1000 | Paid | 79.99 | Everyone | Medical |
| 1407 | LTC AS Legal | MEDICAL | 4.00 | 6 | 1.30 | 100 | Paid | 39.99 | Everyone | Medical |
| 2629 | I am Rich Person | LIFESTYLE | 4.20 | 134 | 1.80 | 1000 | Paid | 37.99 | Everyone | Lifestyle |
| 2481 | A Manual of Acupuncture | MEDICAL | 3.50 | 214 | 68.00 | 1000 | Paid | 33.99 | Everyone | Medical |
| 4264 | Golfshot Plus: Golf GPS | SPORTS | 4.10 | 3387 | 25.00 | 50000 | Paid | 29.99 | Everyone | Sports |
Many ridiculous "I am rich" apps at the top. We remove all apps that cost more than $250.
df_apps_clean = df_apps_clean.loc[df_apps_clean.Price < 250]
df_apps_clean.sort_values('Price', ascending=False)[:5]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2281 | Vargo Anesthesia Mega App | MEDICAL | 4.60 | 92 | 32.00 | 1000 | Paid | 79.99 | Everyone | Medical |
| 1407 | LTC AS Legal | MEDICAL | 4.00 | 6 | 1.30 | 100 | Paid | 39.99 | Everyone | Medical |
| 2629 | I am Rich Person | LIFESTYLE | 4.20 | 134 | 1.80 | 1000 | Paid | 37.99 | Everyone | Lifestyle |
| 2481 | A Manual of Acupuncture | MEDICAL | 3.50 | 214 | 68.00 | 1000 | Paid | 33.99 | Everyone | Medical |
| 2463 | PTA Content Master | MEDICAL | 4.20 | 64 | 41.00 | 1000 | Paid | 29.99 | Everyone | Medical |
When we look at the top 5 most expensive apps now, we see that 4 out of 5 are medical apps.
df_apps_clean['Revenue_Estimate'] = df_apps_clean.Price * df_apps_clean.Installs
We could also use the .mul() function to multiply two Series together: df_apps_clean['Revenue_Estimate'] = df_apps_clean.Installs.mul(df_apps_clean.Price)
df_apps_clean.sort_values('Revenue_Estimate', ascending=False)[:10]
| App | Category | Rating | Reviews | Size_MBs | Installs | Type | Price | Content_Rating | Genres | Revenue_Estimate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9220 | Minecraft | FAMILY | 4.50 | 2376564 | 19.00 | 10000000 | Paid | 6.99 | Everyone 10+ | Arcade;Action & Adventure | 69,900,000.00 |
| 8825 | Hitman Sniper | GAME | 4.60 | 408292 | 29.00 | 10000000 | Paid | 0.99 | Mature 17+ | Action | 9,900,000.00 |
| 7151 | Grand Theft Auto: San Andreas | GAME | 4.40 | 348962 | 26.00 | 1000000 | Paid | 6.99 | Mature 17+ | Action | 6,990,000.00 |
| 7477 | Facetune - For Free | PHOTOGRAPHY | 4.40 | 49553 | 48.00 | 1000000 | Paid | 5.99 | Everyone | Photography | 5,990,000.00 |
| 7977 | Sleep as Android Unlock | LIFESTYLE | 4.50 | 23966 | 0.85 | 1000000 | Paid | 5.99 | Everyone | Lifestyle | 5,990,000.00 |
| 6594 | DraStic DS Emulator | GAME | 4.60 | 87766 | 12.00 | 1000000 | Paid | 4.99 | Everyone | Action | 4,990,000.00 |
| 6082 | Weather Live | WEATHER | 4.50 | 76593 | 4.75 | 500000 | Paid | 5.99 | Everyone | Weather | 2,995,000.00 |
| 7954 | Bloons TD 5 | FAMILY | 4.60 | 190086 | 94.00 | 1000000 | Paid | 2.99 | Everyone | Strategy | 2,990,000.00 |
| 7633 | Five Nights at Freddy's | GAME | 4.60 | 100805 | 50.00 | 1000000 | Paid | 2.99 | Teen | Action | 2,990,000.00 |
| 6746 | Card Wars - Adventure Time | FAMILY | 4.30 | 129603 | 23.00 | 1000000 | Paid | 2.99 | Everyone 10+ | Card;Action & Adventure | 2,990,000.00 |
This is the top 10 highest-grossing paid apps. We see that 7 out the top 10 highest-grossing apps are games.
We can find the number of different categories like so:
df_apps_clean.Category.nunique()
33
top10_category = df_apps_clean.Category.value_counts()[:10].to_frame()
top10_category.rename(columns={'Category': 'Count'}, inplace=True)
top10_category.index.name = 'Category'
top10_category
| Count | |
|---|---|
| Category | |
| FAMILY | 1606 |
| GAME | 910 |
| TOOLS | 719 |
| PRODUCTIVITY | 301 |
| PERSONALIZATION | 298 |
| LIFESTYLE | 297 |
| FINANCE | 296 |
| MEDICAL | 292 |
| PHOTOGRAPHY | 263 |
| BUSINESS | 262 |
bar = px.bar(top10_category, x=top10_category.index, y='Count', height=500)
bar.show()
Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.
But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is. First, we have to group all our apps by category and sum the number of installations:
category_installs = df_apps_clean[['Category', 'Installs']].groupby('Category').sum()
category_installs.sort_values('Installs', inplace=True)
category_installs[:3]
| Installs | |
|---|---|
| Category | |
| EVENTS | 15949410 |
| BEAUTY | 26916200 |
| PARENTING | 31116110 |
We could also have used the .agg() function:
category_installs = df_apps_clean.groupby('Category').agg({'Installs': 'sum'})
category_installs.sort_values('Installs', ascending=True, inplace=True)
category_installs[:2]
| Installs | |
|---|---|
| Category | |
| EVENTS | 15949410 |
| BEAUTY | 26916200 |
We can now create a horizontal bar chart, simply by adding the orientation parameter:
# Careful, x and y are reversed here
h_bar = px.bar(category_installs, x='Installs', y=category_installs.index, orientation='h', height=500,
title='Category Popularity') # custom title
h_bar.update_layout(xaxis_title='Number of Downloads', yaxis_title='Category') # custom axis labels
h_bar.show()
Now we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?
Challenge:
First, create a DataFrame that has the number of apps in one column and the number of installs in another:
Then use the plotly express examples from the documentation alongside the .scatter() API referenceto create scatter plot that looks like this.
Hint: Use the size, hover_name and color parameters in .scatter(). To scale the yaxis, call .update_layout() and specify that the yaxis should be on a log-scale like so: yaxis=dict(type='log')
cat_number = df_apps_clean.groupby('Category').agg({'App': 'count', 'Installs': 'sum'})
cat_number.sort_values('Installs', ascending=False, inplace=True)
cat_number[:6]
| App | Installs | |
|---|---|---|
| Category | ||
| GAME | 910 | 13858762717 |
| COMMUNICATION | 257 | 11039241530 |
| TOOLS | 719 | 8099724500 |
| PRODUCTIVITY | 301 | 5788070180 |
| SOCIAL | 203 | 5487841475 |
| PHOTOGRAPHY | 263 | 4649143130 |
scatter = px.scatter(cat_number, x='App', y='Installs',
size='App', color='Installs', height=600, title='Category Concentration',
log_y=True,
hover_name=cat_number.index) # so that the Category name appears when we hover over a dot.
scatter.update_layout(xaxis_title='Number of Apps (Lower=More Concentrated)', yaxis_title='Installs')
# scatter.update_layout(yaxis=dict(type='log')) Also possible to use this for the log axis
scatter.show()
What we see is that the categories like Family, Tools, and Game have many different apps sharing a high number of downloads. But for the categories like video players and entertainment, all the downloads are concentrated in very few apps.

Challenge: How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.
df_apps_clean.Genres.nunique()
114
If we look at the number of unique values in the Genres column we get 114. But this is not accurate if we have nested data like we do here. We can see this using .value_counts() and looking at the values that just have a single entry. There we see that the semi-colon (;) separates the genre names.
df_apps_clean.Genres.value_counts().to_frame()
| Genres | |
|---|---|
| Tools | 718 |
| Entertainment | 467 |
| Education | 429 |
| Productivity | 301 |
| Personalization | 298 |
| ... | ... |
| Adventure;Brain Games | 1 |
| Travel & Local;Action & Adventure | 1 |
| Art & Design;Pretend Play | 1 |
| Music & Audio;Music & Video | 1 |
| Lifestyle;Pretend Play | 1 |
114 rows × 1 columns
We somehow need to separate the genre names to get a clear picture. This is where the string’s .split() method comes in handy, with expand=True
stack_init = df_apps_clean.Genres.str.split(";", regex=False, expand=True)
stack_init.rename(columns={0: 'Genre 1', 1: 'Genre 2'}, inplace=True)
stack_init
| Genre 1 | Genre 2 | |
|---|---|---|
| 21 | Medical | None |
| 28 | Arcade | None |
| 47 | Arcade | None |
| 82 | Arcade | None |
| 99 | Medical | None |
| ... | ... | ... |
| 10824 | Productivity | None |
| 10828 | Video Players & Editors | None |
| 10829 | Video Players & Editors | None |
| 10831 | News & Magazines | None |
| 10835 | Arcade | None |
8184 rows × 2 columns
From one Series, we create a DataFrame which splitted the Genre in two, now we have Genre 1 and Genre 2.
After we’ve separated our genre names based on the semi-colon, we can add them all into a single column with .stack() and then use .value_counts().
.stack() -> Stack the prescribed level(s) from columns to index.
stack = stack_init.stack().to_frame()
stack.rename(columns={0: 'Genre_Name'}, inplace=True)
stack.sample(5)
| Genre_Name | ||
|---|---|---|
| 3792 | Genre 1 | Photography |
| 8926 | Genre 1 | Sports |
| 1790 | Genre 1 | Events |
| 663 | Genre 2 | Education |
| 7720 | Genre 1 | Productivity |
Now we've got all Genres, 1 and 2, listed under one same column, that we can name "Genre_Name".
With this data, we can now really count the occurence of each Genre.
num_genres = stack.value_counts().to_frame().rename(columns={0: "Nb_of_Apps"})
num_genres.sample(5)
| Nb_of_Apps | |
|---|---|
| Genre_Name | |
| Comics | 54 |
| Trivia | 28 |
| Social | 203 |
| Role Playing | 111 |
| Lifestyle | 298 |
print(f"We have {len(num_genres.Nb_of_Apps)} existing Genres.")
We have 53 existing Genres.
print(f'The "num_genres" dataframe index has {num_genres.index.nlevels} levels.')
The "num_genres" dataframe index has 1 levels.
Only 1 level index but I get an error in Plotly:
Argument 'x' is a pandas MultiIndex. pandas MultiIndex is not supported by plotly express.
So, we reset the index to avoid any issue.
num_genres = num_genres.reset_index()
num_genres.sample(3)
| Genre_Name | Nb_of_Apps | |
|---|---|---|
| 41 | Comics | 54 |
| 12 | Communication | 258 |
| 30 | Food & Drink | 94 |
Challenge: Can you create this chart with the Series containing the genre data?
Try experimenting with the built in colour scales in Plotly. You can find a full list here.
bar = px.bar(num_genres[:15], x='Genre_Name', y='Nb_of_Apps', height=500, color='Nb_of_Apps', title='Top Genres',
color_continuous_scale='Agsunset') # using a specific color scale
bar.update_layout(xaxis_title='Genre', yaxis_title='Number of Apps',
coloraxis_showscale=False) # hiding the legend/color axis on the right of the figure
bar.show()
Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.
df_apps_clean.Type.value_counts().to_frame()
| Type | |
|---|---|
| Free | 7595 |
| Paid | 589 |
We see that the majority of apps are free on the Google Play Store. But perhaps some categories have more paid apps than others. Let’s investigate. We can group our data first by Category and then by Type. Then we can add up the number of apps per each type.
df_free_vs_paid = df_apps_clean[['Category', 'Type']].groupby('Category').value_counts().to_frame()
df_free_vs_paid.rename(columns={0: 'Nb_of_Apps'}, inplace=True)
df_free_vs_paid
| Nb_of_Apps | ||
|---|---|---|
| Category | Type | |
| ART_AND_DESIGN | Free | 58 |
| Paid | 3 | |
| AUTO_AND_VEHICLES | Free | 72 |
| Paid | 1 | |
| BEAUTY | Free | 42 |
| ... | ... | ... |
| TRAVEL_AND_LOCAL | Paid | 8 |
| VIDEO_PLAYERS | Free | 144 |
| Paid | 4 | |
| WEATHER | Free | 65 |
| Paid | 7 |
61 rows × 1 columns
The function .groupby() can actually take more than one column as argument:
df_apps_clean[['Category', 'Type']].groupby(["Category", "Type"]).value_counts().to_frame()
or using the agg() is similar but saves us some code here:
df_apps_clean.groupby(["Category", "Type"]).agg({'App': 'count'})[:4]
| App | ||
|---|---|---|
| Category | Type | |
| ART_AND_DESIGN | Free | 58 |
| Paid | 3 | |
| AUTO_AND_VEHICLES | Free | 72 |
| Paid | 1 |
Finally, by using as_index=False in groupby(), we push all the data into columns rather than end up with our Categories as the index.
df_free_vs_paid = df_apps_clean.groupby(["Category", "Type"], as_index=False).agg({'App': pd.Series.count})
df_free_vs_paid[:4]
| Category | Type | App | |
|---|---|---|---|
| 0 | ART_AND_DESIGN | Free | 58 |
| 1 | ART_AND_DESIGN | Paid | 3 |
| 2 | AUTO_AND_VEHICLES | Free | 72 |
| 3 | AUTO_AND_VEHICLES | Paid | 1 |
Using as_index=False we push all the data into columns rather than end up with our Categories as the index.
Challenge: Use the plotly express bar chart examples and the .bar() API reference to create this bar chart:
You'll want to use the df_free_vs_paid DataFrame that you created above that has the total number of free and paid apps per category.
See if you can figure out how to get the look above by changing the categoryorder to 'total descending' as outlined in the documentation here here.
g_bar = px.bar(df_free_vs_paid, x='Category', y='App', title='Free vs Paid Apps by Category', height=500,
color='Type', log_y=True) # color for each type, and log y
g_bar.update_layout(xaxis_title='Category', yaxis_title='Number of Apps',
barmode='group') # bars grouped instead of stacked
g_bar.update_xaxes(categoryorder='total descending') # arranging the bars by order
g_bar.show()
What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.
But this leads to many more questions:
How much should you charge? What are other apps charging in that category?
How much revenue could you make?
And how many downloads are you potentially giving up because your app is paid?
Let’s try and answer these questions with some Box plots. Box plots show us some handy descriptive statistics in a graph - things like the median value, the maximum value, the minimum value, and some quartiles.
Challenge: Create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?
Use the Box Plots Guide and the .box API reference to create the following chart.
box = px.box(df_apps_clean, x='Type', y='Installs', color='Type', log_y=True, height=500,
points='all', notched=True, # points -> all, and notched=True
title='How Many Downloads are Paid Apps Giving Up?')
box.show()
From the hover text in the chart, we see that the median number of downloads for free apps is 500,000, while the median number of downloads for paid apps is around 5,000! This is massively lower.
But does this mean we should give up on selling a paid app? Let’s see how much revenue we would estimate per category.
Challenge: See if you can generate this chart:
Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs $30,000 or thereabouts, does the average photography app recoup its development costs?
Hint: I've used 'min ascending' to sort the categories.
df_paid_apps = df_apps_clean.loc[df_apps_clean.Type == 'Paid']
box = px.box(df_paid_apps, x='Category', y='Revenue_Estimate', title='How Much Can Paid Apps Earn?', log_y=True, height=500)
box.update_layout(yaxis_title='Paid App Ballpark Revenue')
box.update_xaxes(categoryorder='min ascending') # special ordering of the boxes -> look it up for all the category orders
box.show()
Looking at the hover text, how much does the median app earn in the Tools category? If developing an Android app costs \$30,000 or thereabouts, does the average photography app recoup its development costs?
If an Android app costs \$30,000 to develop, then the average app in very few categories would cover that development cost. The median paid photography app earned about \\$20,000. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.

So, if you were to list a paid app, how should you price it? To help you decide we can look at how your competitors in the same category price their apps.
Challenge: What is the median price price for a paid app? Then compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. I recommend using {categoryorder':'max descending'} to sort the categories.
print(f"The median price for a paid add is ${df_paid_apps.Price.median()}")
The median price for a paid add is $2.99
box = px.box(df_paid_apps, x='Category', y='Price', title='Price per Category', log_y=True, height=500)
box.update_layout(yaxis_title='Paid App Price')
box.update_xaxes(categoryorder='max descending') # special ordering of the boxes -> look it up for all the category orders
box.show()
Some categories have higher median prices than others. This time we see that Medical apps have the most expensive apps as well as a median price of \$5.49. In contrast, Personalisation apps are quite cheap on average at \\$1.49. Other categories which higher median prices are Business (\$4.99) and Dating (\\$6.99). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.

In this lesson we looked at how to:
How to use .info() or .dtypes to get the types of data we're dealing with
Pull a random sample from a DataFrame using .sample()
How to find duplicate entries with .duplicated() and .drop_duplicates()
How to convert data types with Series.astype(str), Series.astype(int) and pd.to_numeric(Series)
How to use plotly to generate beautiful pie, donut, and bar charts as well as box and scatter plots: .pie(), .bar(), .box(), .scatter(), .update_layout(), .update_traces() for pie charts mainly, .update_xaxes()
How to use the .display() and .HTML() functions (to import first)
How to use build formatters and use it to display integers in a certain way, for instance with thousand separators
How to render or transform a Series with a special format df_apps_clean.Installs.map('{:,}'.format), knowing it has become a String Series though
How to use the string methods .str.split() and .str.replace() with Series, and define regex= True or False
How to use expand=True with the .split() method to create separate columns instead of
How to use nunique() to find the number of unique values in a Series
How to use agg() with more than one argument, e.g. df_apps_clean.groupby('Category').agg({'App': 'count', 'Installs': 'sum'})
How to .groupby() more than one category at the same time, e.g. df_apps_clean.groupby(["Category", "Type"]).agg({'App': 'count'}) and how to use as_index=False to have the grouped category as columns and not as index
How to use the .stack() method to stack from columns to index.
How to use DataFrame.index.nlevels to know the number of levels of an index, and to use DataFrame.reset_index()
How to use a double backslash \\ for dollar signs in markdowns.